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(54) Title: PROCESS FOR IDENTIFYING AUDIO CONTENT 

© 

^ (57) Abstract: A fingerprint of an audio signal is generated based on the energy content in frequency subbands. Processing tech- 
^ niques assure a robust identification fingerprint that will be useful for signals altered subsequent to the generation of the fingerprint. 
^ The fingerprint is compared to a database to identify the audio signal. 
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PROCESS FOR IDENTIFYING AUDIO CONTENT 



BACKGROUND OF THE INVENTION 
The present invention relates to techniques for automatically identifying 
5 musical pieces by monitoring the content of an audio signal. 

Several techniques have been devised in the past to achieve this goal 
Many of the techniques rely on side information extracted, for example, from sideband 
modulation (in FM broadcast) or depend on inaudible signals (watermarks) having been 
inserted in the material being played. 
10 A few patents describe techniques that seek to solve t)ie problem by 

identifying songs without any side information by extracting "fingerprints" from the song 
itself, see e.g.: 

• Patent (US4230990) which describes a system that relies on a frequency domain 
analysis of the signal, but also requires the presence of a predetermined "signaling 

IS event" such as a short single-frequency tone, in the audio or video signal. 

• Patent (US39 19479) which describes a system designed to identify commercials in 
TV broadcasts. The system extract a low-frequency envelope signal and correlates it 
with signals in a database. 

However, the systems described in these patents suffer significant 

20 drawbacks: 

• The fingerprint matching technique is usually based on a cross- 
correlation, which is typically a costly process and is impractical when 
large databases of fingerprints are to be used. 

• The fingerprint which is extracted from 'the signal is not very robust io 
25 signal alterations such as coding artifacts, distortion, spectral 

coloration, reverberation and other effects that might have been added 
to the material. 

Accordingly, simple identifications techniques that are robust to signal alterations are 
required. 



SUMMARY OF THE INVENTION 
According to one aspect of the invention, musical pieces (e.g (> a given 
song by a given artist) can be automatically identified by monitoring the content of the 
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audio signal. A typical example is a device that continuously listens to a radio broadcast, 
and is able to identify the music being played without using any side information or 
watermarking technique, U, the signal being listened to was not preproeessed in any 
manner (for example, to insert inaudible identifying sequences, as in watermarking), 

5 According to another aspect of the invention, the method comprises the 

acts of extracting a fingerprint from the first few seconds of audio, and then comparing 
this fingerprint to those stored in a large database of songs. Because it is desired to 
identify songs taken among a veiy large set (several hundreds of thousands), the 
fingerprint matching process is extremely simple, since it requires comparing the 

10 fingerprint to several hundreds of thousands, and yields a reliable result in a small amount 
of time (less than a second, for example). 

According to another aspect of the invention, the identification process is 
fairly robust to alterations that might be present in the signal, such as audio 
coding/decoding artifacts, distortion, spectral coloration, reverberation and so on. These 

15 alterations might be undesirable, for example resulting from defects in the coding or 

transmission process, or might have been added on puipose (for example, reverberation or 
dynamic range compression). In either case, these alterations do not prevent the 
identification of the musical piece. 

According to another aspect of the invention, subband energy signals, 

20 having a magnitude in dB, are extracted from overlapping frames of the signal. A 

difference signal is then generated for each subband. The frequency components of the 
difference signals from the difference signals of the subbands is used as a fingerprint. 

According to another aspect of the invention, the subband energy signals 
are smoothed so the fingerprint will be still be usefiil to identify a signal that has been 

25 subsequently altered. For example, the signal may have had reverb efftcts added. 

According to another aspect of the invention, the fingerprint is compared 
to a fingerprint database to identify the audio signal. 

According to another aspect of the invention, local maxima of selected 
parameter of the audio signal are located and a fingerprint monitoring period is located 

30 near a local maxima. 

Other features and advantages of the invention will be apparent from the 
following detailed description and appended drawings. 
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BRIEF DESCRIPTION OF THE DRAWINGS 
Fig. 1 is flowchart of the fingerprint extraction algorithm. 
Fig. 2 is a graph depicting local maxima of an audio signal's time varying 



energy. 



DESCRIPTION OF THE SPECIFIC EMBODIMENTS • 
Overview 

The general idea of the technique presented embodied by the present 
invention includes the acts of analyzing the audio signal by use of a Short-Term Fourier 

10 Transform, then forming a number of derived signals that represent the energy in dB in N 
selected frequency bands. These energy signals are recorded fcr, say, the first 10 seconds 
of the audio signal, yielding //energy signals of Appoints each. Each of these //signals 
are then differentiated with respect to time, and the resulting signals undergo a Fourier . 
Transform, yielding N frequency-domain energy signals of AO points each. The 

1 5 magnitude of the first few values of these frequency-domain signals are then extracted 
and concatenated to form the fingerprint. 

This fingerprint is then compared to fingerprints in a database, simply by 
calculating the Euclidean norm between the fingerprint and the database candidates; The 
database candidate which yields the smallest norm indicates the identified musical piece. 

20 As depicted in Fig. 1 , a short-term Fourier transform^ is calculated on 

the incoming signal 12. the magnitudes of the FFT bins 14 are summed 16 within 
predefined frequency bands, and the results, expressed in dB 1 8, are processed by a first- 
order difference filter 20 (and, optionally, by a non-linear smoothing filter 21). A second 
FFT 22 is calculated on the first order difference signal and magnitudes 24 are utilized as 

25 the fingerprint. Each of these steps are described in detail below, 

Extracting the time-domain subband energy signals 

The incoming signal is first analyzed by use of an overlapping short-term 
Fourier transform: once every 256 samples, a 1024 sample frame of the signal is 
30 extracted and multiplied by a weighting window, and then processed by a Fourier 

transform. For each frame, the magnitudes of the FFT bins in N selected frequency bands 
are summed, and the //results are expressed in dB. As a result, there are now N energy 
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values for each signal fam £U. every 2S6 samples), or in other words. N subband 
energy signal expressed in dB. 

&i seme oases, for example vfam desiring to develop a fingerprint feftt 
identifies a signal having reverb added, the subband energy signals can be further 
5 smoothed, for example by use of a non-linear exponential-memory Alter 21. Denoting 
E(f, n) the energy in subband/at frame n, filtered signals £(f, n) are defined by 

where a is a smoothing parameter with a small value. This ensures that 
n) closely follows during increasing segments, but is smoothed out when £(f. 

10 n) decreases. When reverberation is added to a signal energy levels tend to be sustained. 
Reverbs usually have a short attack time and a long sustain so that smoothing is only done 
on decay (£(f t n) decreasing) but not on atack. Thus, by smoothing only decreasing 
subband components a more robust fingerprint is obtained that will identify signals 
having reverb added For example, if the audio signal being identified has not been 

15 altered then its fingerprint will exactly match the fingerprint stored in the database. On 
the other hand, if the audio signal being identified has reverb added then the smoothing 
filter will not change the energy curve significantly because the energy of the signal has 
already been "smoothed" by the added reverb. Thus, the fingerprint derived from the 
aduio signal being analysed will closely match the stored fingerprint. 

20 ' In practice, frequency bands should be chosen that sjfen a useflil portion of 

the frequency range, but prove to be the least affected by artifacts and signal alterations. 
For example, 4 bands between 100Hz and 2kHz could be chosen. 

The step of calculating the siibband energy signals can also be done 
entirely in the time-domain, by using bandpass filters tuned to the desired band, 

25 downsampling the output and calculating its power. 

Calculating the Fourier transform of the energy signals 

At the end of the ''monitoring period", for example 10 seconds after the 
start of the audio signal, the N subband energy signals are processed by a first-order 
30 difference, yielding subband "energy flux" signals. 

Because £(f t n) is expressed in dB, thts has the desirable effect of 
discarding any constant-amplirude.factor in the subband energy. In other word, two 
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signals that only differ by their amplitude will yield the same signals %g(f>n) • Similarly, 
two signals that only differ by a constant or slowly time-varying transfer ftneden (the 
aeeond being obtained by filtering the first by a constant or slowly tim^-varying filter) 

will yield very similar 3 £ (f t n) , which is very desirable. 

S A window is then applied to each subband energy flux signal, and the 

Fourier transform of the result is taken, yielding //frequency-domain signals £>z(f.F) 
(where/is the subband, and J 7 is the frequency). Taking the magnitude of the result 
ensures that the frequency-domain signals are somewhat robust to time-delay. In other 
words, two signals that only differ by a relatively small delay (compared to the duration 

10 of the Monitoring period") will yield very similar signals | £> £ (f.F)\, which is highly 
desirable, since in practical applications, there might not be a reliable reference for time- 
aligning the signal. 



Forming the fingerprint 
1 5 The fingerprint is obtained by selecting the first few values of the 

magnitude-only frequency-domain energy-flux signals \f> £ (f>F)\, for lvalues of F close 

to 0Hz.(for example, up to 6 or 7Hz), in each subband/ A window can be applied to the 

values so the magnitude decreases for increasing frequencies. This ensures that more 

attention is paid to low-frequencies (which describe the slow variations of the energy flux 

20 signal) than to high-frequencies (which describe the finer details of the energy flux * 

signal). 

Concatenating the k values extracted from each of the AT bands produces 
the audio fingerprint. The fingerprint can additionally be quantized and represented using 
a small number of bits (for example as an 8-bit word). 

2S 

Matching the fingerprint 

The fingerprint can then be compared to a database of fingerprints 
extracted from known material A simple Euclidean distance is calculated between the 
fingerprint and the candidate fingerprints in the database. The candidate fingerprint that 
30 corresponds to the smallest Euclidean distance indicates which material was played. The 
value of the Euclidean distance also indicates whether there is a good match between the 
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fingerprint and the candidate (good recognition oertainty) or whefeer the match « only 
approximate. 



The problem of time-alignment 

5 Tbe technique described above is somewhat immune to mismatches in 

time-altguncnt, but not emiroly. If the segment of the audio being analysed does not 
correspond exactly to the same segment in the database, the monitored fingerprint will be 
slightly different from the fingerprint in the database, all the more different as the two 
segments arc further apart in time. In some cases, it is reasonable to expect that the 

1 0 monitoring device will have some notion of where the beginning of the song is, in which 
case the analyzed segment will correspond fairly well to the segment in the 'database. In 
some other situations (for example, when monitoring a stream of audio without clear 
breaks), the monitoring device will have no notion of where the beginning of the track is, 
and the fingerprint matching will fail. 

15 One way around that problem consists of monitoring the overall energy of 

the signal (or some simple time-varying feature of the signal), identifying local maxima 
and setting the time at the local maximum as the begiiming of the monitoring period (the 
period over which the fingerprint will be determined). For signals in the database, the 
monitoring period could be located at one of the local maxima of the energy and the 

20 fingerprint determined from mat monitoring period. For the signal to be identified by the 
device, the device could locate local maxima and check the corresponding fingerprints 
with the database. Eventually, one of the local maxima will fall very near the local 
maximum which was used in the database and will yield a very good fingerprint match, 
. while the other fingerprints taken at other local maxima will not fit welL 

25 Identification when no time-reference is available is described in Fig. 2. 

The signal's time-varying energy is calculated and local maxima 50 axe determined. For 
the database fingerprints, the monitoring period 52 is located relative to one of the local 
maxima. The monitoring device calculates fingerprints around local maxima of the 
energy and matches them with the database. A good match is only obtained when the 

30 local maximum is the one that was used for the database. This good match is used to 
determine which song is being played. By detecting that a good match was obtained (for 
example, because the Euclidean distance between the fingerprint and the best database 
candidate is below a threshold), the device will be able to reliably identify the music piece 
being played. The choice of the local maximum used in the database can be arbitrary. 
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One wight want to pick one that is close to the beginning of the song, which would make 
the identification faster (since the monitoring device will reach that local maximum 
faster), 

In a preferred embodiment the invention is implemented in software, 
S stored on a computer-readable storage structure, executed by a computing system which 
may include a digital signal processor (DSP). 

The invention has now been described with reference to the preferred 
embodiments. Alternatives and substitutions will now be apparent to persons of ordinary 
skill in the art. For example, different smoothing algorithms to add robustness for 
10 different effects are known in the art. In addition specific frequencies or ranges are 

denoted for purposes of illustration not limitation. Accordingly, it is not intended to limit 
the invention except as provided by the appended claims. 
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guises 

1 1 . A method of identifying a digital audio signal by monitoring tho 

2 content of the audio signal, said method comprising the acts of: 

3 Meeting a sec of frequency subbands of said audio signal, with caeh 

4 frequency having a selected frequency range; 

5 for each subband, generating subband energy signal having a magnitude, 

6 in decibels (dB)» equal to signal energy in the subband; 

7 forming an energy flux signal for each subband having a magnitude equal 

8 to the difference between subband energy signals of neighboring frames; 

9 determining the magnitude of frequency components bins of the energy 

1 0 flux signal for each subband; 

1 1 forming a fingerprint comprising the magnitudes of the frequency 

1 2 component bins of the energy flux signal for all subbands; and 

13 comparing the fingerprint for the audio file to fingerprints in a database to 

14 identify the audio file. 

1 2. The method of claim 1 where said step of generating a subband 

I energy signal comprises the acts of: 

3 for each subband, filtering the audio signal to obtain a filtered signal 

4 having only frequency components in the subband; and 

5 calculating the power of the filtered signal. 

1 3. A method of generating a fingerprint for identifying an audio 

2 signal, said method comprising the acts of: 

3 selecting a set of frequency subbands of said audio signal, with each 

4 frequency having a selected frequency range; 

5 for each subband, generating subband energy signal having a magnitude, 

6 in decibels (dB), equal to signal energy in the subband; 

7 forming an energy flux signal for each subband having a magnitude equal 

8 to the difference between subband energy signals of neighboring frames; 

9 determining the magnitude of frequency components bins of the energy 

10 flux signal for each subband; and 



8 



WO 01/88900 PCT/IB01/00982 

1 1 forming a fingerprint comprising the magnitudes of the frequency 

12 component bins of the energy flux signal for all subbands. 

1 4. The medio* of olaim 3 where said step of generating a subbaod 

I energy signal comprises the aets of: 

3 dividing a segment of the signal into overlapping frames; 

4 for each, frame, determining the magnitude of frequency bins at different 

5 frequencies; 

6 selecting a set of frequency subbands of a desired frequency range; 

7 for eafahsubband and each frame, summing the frequency bins of the 

8 frame located within the subband to form a subband energy signal having a magnitude 

9 expressed in decibels (dB) for the given frame 

1 5. The method of claim 3 where said step of generating a subband 

2 energy signal comprises the acts of: 

3 for each subband, filtering the audio signal to obtain a filtered signal 

4 having only frequency components in the subband; and 

5 calculating the power of the filtered signal. 

1 6. A method of identifying a digital audio signal by moni^ring the 

2 content of the audio signal, said method comprising the acts of: 

3 dividing a segment of the signal into overlapping frames; . 

4 for each frame, determining the magnitude of frequency bins at ditferent oi 

5 frequencies; 

6 selecting a set of frequency subbands of a desired frequency range; 

7 for each subband and each frame, summing the frequency bins of the 

8 frame located within the subband to form a subband energy signal having a magnitude 

9 expressed in decibels (dB) for the given frame; 

10 forming an energy flux signal for each subband having a magnitude equal 

1 1 to the difference between subband energy signals of neighboring frames; 

12 determining the magnitude of frequency components bins of the energy 

13 flux signal for each subband; 

14 forming a fingerprint comprising the magnitudes of the frequency 

1 5 component bins of the energy flux signal for all subbands; 
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16 comparing the fingerprint for the audio file to fingerprints in a database to 

1? ' identify the audio file, 

I 1. The method of claim 6 further comprising the aets of: 

I smoothing »e subband energy signal for eaeh subband to compensate for 

3 subsequent alterations of (He audio signal, 

1 8, The method of dawn 6 further comprising the acts ofe 

2 generating local maxima of a parameter of the audio signal; and 

3 locating a fingerprint monitoring period near a local maxima. 

x 9, The method of claim 8 where said act of generating comprises: 

2 generating local maxima of the energy content of the audio signal. 

\ 1 0. A computer program product comprising: 

2 a computer readable storage medium having computer program code 

3 embodied therein for forming a fingerprint for identifying an audio file, said computer 

4 program code comprising: 

5 program code for causing a computing system to select a set of frequency 

6 subbands of said audio signal, with each frequency having a selected frequency range; 

7 for each subband, program code for causing a computing system to 

8 generate subband energy signal having a magnitude, in decibels (dB), equal to signal 

9 energy in the subband; 

1 0 program code for causing a computing system to form an energy flux 

1 1 signal for each subband having a magnitude equal to the difference between subband 

12 energy signals of adj acent frames; 

13 program code for causing a computing system to determine the magnitude 

14 of frequency components bins of the energy flux signal for each subband; and 

1 5 program code for causing a computing system to form a fingerprint 

16 comprising the magnitudes of the frequency component bins of the energy flux signal for 

17 all subbands 
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FIG. 2. 
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